source("EDA.R")
aqi_test <- aqi_test %>%
mutate(final_cat = factor(final_cat, levels = levels(aqi_train$final_cat)),
pm10_cat = factor(pm10_cat, levels = levels(aqi_train$pm10_cat)),
pm25_cat = factor(pm25_cat, levels = levels(aqi_train$pm25_cat)),
no2_cat = factor(no2_cat, levels = levels(aqi_train$no2_cat)),
so2_cat = factor(so2_cat, levels = levels(aqi_train$so2_cat)),
o3_cat = factor(o3_cat, levels = levels(aqi_train$o3_cat)),
co_cat = factor(co_cat, levels = levels(aqi_train$co_cat)))
All the EDA work is loaded.
library(tidyverse)
library(leaps)
library(tree)
library(knitr)
library(kableExtra)
library(ggdendro)
library(clustMixType)
library(randomForest)
library(keras)
library(ggmap)
library(maps)
library(e1071)
library(kernlab)
library(onehot)
All the necessary packages have been loaded.
Citation:
Moch, J. (n.d.). Air Quality Index Archive based on ground monitors in China. Retrieved April 13, 2018, from http://aqi.cga.harvard.edu/china/about/ conducted by Harvard John A. Paulson School of Engineering and Applied Sciences
Direct link to the datasets: http://aqi.cga.harvard.edu/china/ <- then click on aqi.csv
From my DataMemo: In general, air quality is influenced by various factors such as weather, and weather forecast (prediction) is done using a lot of quantitative data of the current state. Therefore, my project is not about making predictions, but more about making inferences and learning how air quality is determined using the dataset I have. Also, I am planning to use the average “Air Quality Category”, described in EDA, as my respondent variable. So my project will be a classification type. Classification and Regression Tree (CART) will (most likely) be used extensively for this project.
Air pollution has started to become much more serious environmental problem in South Korea lately. One major cause of air pollution is the high population density of Korea. South Korea is 0.7 times as big as Illinois, and yet the population of Korea is more than 4 times greater than the population of Illinois. In fact the population of capital city Seoul is around 10.3 million; Seoul’s population density is almost twice that of New York City, four times higher than Los Angeles.
Also, due to natural environmental forces such as wind that blows from China, another country that also suffers a lot from air pollution, the air quality in the Korean peninsula becomes ineffably worse. Therefore, South Korean government is working hard to improve air quality; for example one strategy was to offer free public transportation to limit car use: https://www.straitstimes.com/asia/east-asia/free-public-transport-in-seoul-amid-thick-smog. However, none of the strategies has proven to be very effective.
Below is a picture showing the obnoxious air quality of Korea when especially large amount of thick smog and pollutants have flown to the Korean peninsula from China by wind.
The United States Environmental Protection Agency (EPA) keeps track of 6 major air pollutants:
PM2.5 stands for particulate matter 2.5 micrometers or less in diameter. Particulate Matters are the sum of all solid and liquid particles suspended in air many of which are hazardous. This complex mixture includes both organic and inorganic particles, such as dust, pollen, soot, smoke, and liquid droplets.
The following table summarizes the detrimental health effects that each pollutants can cause on different groups of people.
I will perform the same initial exploration of my training set that I performed in my EDA. When AQI score is not computed (and simply raw scaled concentrations are used), we see that PM2.5 seems to be the most problematic pollutant; it seems to have the highest scaled concentration.
#training set; pollutants by hour
aqi_train %>%
mutate(
sc_pm25 = pm25 / sd(pm25, na.rm = T),
sc_pm10 = pm10 / sd(pm10, na.rm = T),
sc_o3 = o3 / sd(o3, na.rm = T),
sc_no2 = no2 / sd(no2, na.rm = T),
sc_so2 = so2 / sd(so2, na.rm = T),
sc_co = co / sd(co, na.rm = T)
) %>%
group_by(hour) %>%
summarise( pm25 = mean(sc_pm25, na.rm = TRUE),
pm10 = mean(sc_pm10, na.rm = TRUE),
o3 = mean(sc_o3, na.rm = TRUE),
no2 = mean(sc_no2, na.rm = TRUE),
so2 = mean(sc_so2, na.rm = TRUE),
co = mean(sc_co, na.rm = TRUE)
)%>%
gather(pm25, pm10, o3, no2, so2, co, key = pollutants, value = scaled_mean) %>%
ggplot( aes( x = hour ))+
geom_bar( aes(y = scaled_mean, fill = pollutants), stat = "identity", position = "dodge")
#By pollutants
# grouping by pollutants
poll_stat <- aqi_train %>%
select(pm25, pm10, o3, no2, so2, co) %>%
gather() %>%
group_by(key) %>%
summarise(min = min(value, na.rm = TRUE),
max = max(value, na.rm = TRUE),
rng = max - min,
mean = mean(value, na.rm = TRUE), #center
median = median(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE), #spread
iqr = IQR(value, na.rm = TRUE))
poll_stat %>%
mutate(scaled = mean / sd,
sdup = scaled + 1.96, #95% CI
sddw = scaled - 1.96) %>%
ggplot(aes(key)) +
geom_point(aes(y = scaled, color = key)) +
geom_point(aes(y = sdup), pch = 2) + # upper bound for 95% CI
geom_point(aes(y = sddw), pch = 6) + # lower bound for 95% CI
labs(title = "Scaled Average Concentrations",
x = "Key Pollutants",
y = "Scaled Concentration",
caption = paste("since this shows *scaled* concentration, each
pollutant's concentration can be compared to one another")) +
geom_label_repel(aes(label = key, y = scaled, color = key),
data = poll_stat_helper,
segment.color = NA,
size = 2.5) +
theme_bw() +
theme(legend.position = "none")
The AQI score conversion from raw concentration is simple. First of all, we need to know in which air-quality-category our raw concentrations fall into. The table below uses 6 air quality categories:
I used similar shorter-name (to write shorter code) categories:
For each pollutant’s concentration, find the box that the concentration falls in and then move across to the right to see approximately what the corresponding AQI score would be and which category the concentration falls into.
The actual AQI score conversion is all about finding the concentration breakpoints of each raw concentration and then using the formula below. Long functions were written in EDA to convert every raw concentration into AQI scores.
The final AQI score is the highest AQI score value calculated for each pollutant. For example, in the figure below, I put stars next to the AQI scores that are used as the final AQI scores
After the raw concentrations have been converted to the AQI score, we see that the most problematic pollutant seems like \(O_3\), and not PM2.5.
aqi_train %>%
ggplot(aes(reorder(max_pollut, max_aqi, median), max_aqi)) +
geom_point(aes(col = final_cat)) +
geom_boxplot(size = 0.05, alpha = 0.3)+
theme_bw()+
theme(legend.position = "none")
aqi_train %>%
ggplot(aes(final_cat))+
geom_bar(aes(fill = max_pollut), , position = "dodge")+
ylim(c(0, 600)) +
coord_flip() +
theme_bw()
So why does \(O_3\), when converted to AQI scores, turn out to be such a problematic pollutant? According to Table 3 under the section Pollutants that describes which specific groups of people are affected by each pollutant, we can see that \(O_3\) pollutant seems to affect the largest number of different groups. I think that is the reason why ozone’s concentrations end up becoming the most frequently used determinant of the final AQI score.
This is a map visualization of the entire dataset (training & validation sets). We see that most of the data is gathered from Eastern China, and also majority of the AQI scores fall into the very bad category. All these categories were determined after AQI conversions were completed.
# --------getting the map
china <- get_map("china", maptype = "roadmap", zoom = 4) %>%
ggmap()
# china
#
# --------putting points on the map
# china +
# geom_point(aes(x = 117, y = 40), #one point
# color = "red", size = 3)
# china +
# geom_point(aes(x = longitude, y = latitude), #all the points
# data = aqi_train)
china +
geom_point(aes(x = longitude, y = latitude , col = final_cat),
data = na.omit(aqi)) +
scale_color_manual(values=c("yellow", "orange", "red", "purple", "black", "black")) +
theme(legend.position = "top")
Because the final AQI is determined by the maximum AQI score (there is a formula to compute the final AQI score), perhaps it is not very meaningful to perform the best subset selection. However, I thought it would be nice to see if statistical learning mechanism also agrees with my dataset above that \(O_3\) is the most frequently used pollutants for determining the final AQI score. Also, I was curious to see if any of the non-pollutant variables (especially, atmospheric variables) had any relatively strong correlation with the maximum/final AQI score.
As you can see from the result below, \(O_3\) is the first pollutant to be added for best subset and forward seleciton; and it is also the last pollutant to remain in the backward selection.
Note that all six pollutants as well as all the atmospheric variables such as pressure, wind, temp, dewpoint, and humidity were included. But as you can see from the results below, the pollutants are always prioritized first. It is also interesting to see that humidity is selected first for the best subset and forward seleciton, and discarded last for the backward selection among the non-pollutant variables.
regfit_full <- aqi_train %>%
regsubsets(max_aqi~aqi_pm10+aqi_pm25+aqi_no2+aqi_so2+aqi_o3+aqi_co+ #pollutants
pressure+wind+hour+temp+dew+humid, #atm_var
data = .,
nvmax = 7) #to return maximum 12 variables
reg_summary <- regfit_full %>%
summary()
#into a nice table format
#notice aqi_o3 is selected first
#notice no atmospheric variables are selected before all the pollutants are selected
#but humid is selected first among the atmospheric variables
reg_summary$outmat %>%
data.frame() %>%
rownames_to_column(var = "num_variables") %>%
kable("html") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, font_size = 10)
| num_variables | aqi_pm10 | aqi_pm25 | aqi_no2 | aqi_so2 | aqi_o3 | aqi_co | pressure | wind | hour | temp | dew | humid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 ( 1 ) |
|
|||||||||||
| 2 ( 1 ) |
|
|
||||||||||
| 3 ( 1 ) |
|
|
|
|||||||||
| 4 ( 1 ) |
|
|
|
|
||||||||
| 5 ( 1 ) |
|
|
|
|
|
|||||||
| 6 ( 1 ) |
|
|
|
|
|
|
||||||
| 7 ( 1 ) |
|
|
|
|
|
|
|
regfit_fwd <- aqi_train %>%
regsubsets(max_aqi~aqi_pm10+aqi_pm25+aqi_no2+aqi_so2+aqi_o3+aqi_co+ #pollutants
pressure+wind+hour+temp+dew+humid, #atm_var
data = .,
nvmax = 7,
method = "forward") #forward selection
regfit_fwd %>%
summary() %>%
.$outmat %>%
data.frame() %>%
rownames_to_column(var = "num_variables") %>%
kable("html") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, font_size = 10) #Again all the pollutants first, and then humid
| num_variables | aqi_pm10 | aqi_pm25 | aqi_no2 | aqi_so2 | aqi_o3 | aqi_co | pressure | wind | hour | temp | dew | humid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 ( 1 ) |
|
|||||||||||
| 2 ( 1 ) |
|
|
||||||||||
| 3 ( 1 ) |
|
|
|
|||||||||
| 4 ( 1 ) |
|
|
|
|
||||||||
| 5 ( 1 ) |
|
|
|
|
|
|||||||
| 6 ( 1 ) |
|
|
|
|
|
|
||||||
| 7 ( 1 ) |
|
|
|
|
|
|
|
regfit_bwd <- aqi_train %>%
regsubsets(max_aqi~aqi_pm10+aqi_pm25+aqi_no2+aqi_so2+aqi_o3+aqi_co+ #pollutants
pressure+wind+hour+temp+dew+humid, #atm_var
data = .,
nvmax = 7,
method = "backward") #backward selection
regfit_bwd %>%
summary() %>%
.$outmat %>%
data.frame() %>%
rownames_to_column(var = "num_variables") %>%
kable("html") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, font_size = 10) #Again all the pollutants first, and then humid
| num_variables | aqi_pm10 | aqi_pm25 | aqi_no2 | aqi_so2 | aqi_o3 | aqi_co | pressure | wind | hour | temp | dew | humid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 ( 1 ) |
|
|||||||||||
| 2 ( 1 ) |
|
|
||||||||||
| 3 ( 1 ) |
|
|
|
|||||||||
| 4 ( 1 ) |
|
|
|
|
||||||||
| 5 ( 1 ) |
|
|
|
|
|
|||||||
| 6 ( 1 ) |
|
|
|
|
|
|
||||||
| 7 ( 1 ) |
|
|
|
|
|
|
|
Our single classification tree also tells us that the AQI score of \(O_3\) is the most important factor in determining the final category of the air quality because the top three internal nodes are occupied by the AQI score of \(O_3\).
tree_aqi_train <- tree(final_cat ~ aqi_pm10+aqi_pm25+aqi_no2+aqi_so2+aqi_o3+aqi_co+
pressure+wind+hour+temp+dew+humid,
data = aqi_train)
plot(tree_aqi_train)
text(tree_aqi_train, pretty = 0, cex = 0.7) #notice that the AQI score of o3 occupies the top three internal nodes
10-fold Cross-valiation of the single classfication tree
set.seed(3)
cv_tree <- cv.tree(tree_aqi_train, FUN = prune.misclass)
tibble(size = cv_tree$size, k = cv_tree$k, cv_error = cv_tree$dev) %>%
ggplot(aes(x = size, y = cv_error)) +
geom_point() +
geom_line() +
geom_vline(xintercept = cv_tree$size[cv_tree$dev == min(cv_tree$dev)],
color = "royalblue") + # 9 leaves
theme_bw()
Pruned Tree
A tree with 9 leaves.
tree_pruned <- prune.misclass(tree_aqi_train, best = 9)
plot(tree_pruned)
text(tree_pruned, pretty = 0, cex = 0.5)
Confusion Table with pruned tree, using the validation set
tree_pred <- predict(tree_pruned, aqi_test, type = "class")
table(tree_pred, aqi_test$final_cat) %>%
table_nice()
| very good | good | moderate | bad | very bad | dangerous | toxic | |
|---|---|---|---|---|---|---|---|
| very good | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| good | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| moderate | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| bad | 0 | 0 | 2 | 25 | 2 | 1 | 0 |
| very bad | 0 | 0 | 0 | 0 | 134 | 7 | 1 |
| dangerous | 0 | 0 | 0 | 0 | 0 | 45 | 4 |
| toxic | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
Correct rate (from the above confusion table)
mean(tree_pred == aqi_test$final_cat, na.rm = T) %>% kable("html") %>%
kable_styling(full_width = FALSE, position = "left") %>%
add_header_above(c("correct rate" = 1)) # 0.9244444
| x |
|---|
| 0.9244444 |
set.seed(3)
aqi_train_pollut <- na.omit(aqi_train) %>%
select(contains("aqi"), max_pollut, final_cat) %>%
mutate(final_cat = factor(final_cat)) #this step is necessary to make the subsequent randomForest() function to work. Otherwise there will be no data corresponding to the category "very good" and "good"; then randomForest will produce an error
aqi_rf <- randomForest(final_cat ~ aqi_pm10+aqi_pm25+aqi_no2+aqi_so2+aqi_o3+aqi_co,
data = aqi_train_pollut,
mtry = 3,
importance = TRUE)
aqi_rf_pred <- predict(aqi_rf, newdata = aqi_test, type = "class")
# mean(aqi_rf_pred != aqi_test$final_cat) #again this is caused by the fact that not all categories are included in aqi_train_pollut, namely "very good" and "good"
Variable Importance
The left column/graph is the mean decrease of accuracy in predictions on the out of bag samples when a given variable is exluded from the model. The right is a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees. The results indicate that the AQI score of \(O_3\) and then PM2.5 are by far the two most important variables.
importance(aqi_rf) %>% as_tibble() %>%
mutate(pollutant = c("aqi_pm10", "aqi_pm25", "aqi_no2", "aqi_so2", "aqi_o3", "aqi_co")) %>%
.[,c(8,6:7)] %>%
arrange(desc(MeanDecreaseAccuracy)) %>% table_nice()
| pollutant | MeanDecreaseAccuracy | MeanDecreaseGini |
|---|---|---|
| aqi_o3 | 229.446476 | 213.725130 |
| aqi_pm25 | 77.497696 | 58.735804 |
| aqi_co | 45.821765 | 22.093012 |
| aqi_so2 | 43.183839 | 22.364230 |
| aqi_pm10 | 21.283933 | 30.781827 |
| aqi_no2 | 6.237022 | 7.181354 |
varImpPlot(aqi_rf)
atm_variables <- c("pressure", "wind", "hour", "temp", "dew", "humid")
pca <- na.omit(aqi_train) %>% #missing values are gone
select(contains("aqi"), - max_aqi) %>%
prcomp(scale = TRUE)
pca$rotation %>%
table_nice()
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | |
|---|---|---|---|---|---|---|
| aqi_pm25 | -0.5015709 | 0.0909083 | -0.4234672 | -0.1602461 | 0.0845494 | -0.7266433 |
| aqi_pm10 | -0.4852655 | 0.1482276 | -0.4766389 | -0.1352933 | 0.1789547 | 0.6819327 |
| aqi_o3 | 0.1589983 | 0.9217726 | -0.0596197 | 0.3382312 | -0.0726263 | -0.0427251 |
| aqi_no2 | -0.4203251 | 0.1836324 | 0.4261280 | -0.3665266 | -0.6850824 | 0.0658872 |
| aqi_so2 | -0.3988459 | 0.1345927 | 0.6390234 | 0.0680871 | 0.6398330 | -0.0208267 |
| aqi_co | -0.3897594 | -0.2612728 | 0.0005944 | 0.8382356 | -0.2771784 | 0.0188940 |
Biplot, using the first 2 principal component analysis
The first loading vector places approximately equal weight on \(NO_2\), \(SO_2\), PM2.5, and PM 10. The second loading vector places most of its weight on \(O_3\) and also slightly weakly on CO. Once again, it is interesting to see how \(O_3\) seems to stand out from the rest of the pollutants.
ggbiplot::ggbiplot(pca, scale = 0, alpha = 0.1)
Scree Plot
Proportion of variance explained each principal components
#variance for each component
pca_var <- pca$sdev^2
#Proportion of variance explained by each principal component
pve <- pca_var/sum(pca_var)
tibble(x = 1:6, y1 = pve, y2 = cumsum(pve)) %>%
ggplot(aes(x)) +
ylim(c(0,1)) +
geom_point(aes(y = y1), size = 3) +
geom_line( aes(y = y1) ) +
geom_point(aes(y = y2), size = 3) +
geom_line( aes(y = y2), color = "royalblue") +
xlab("Principal Component") +
ylab("Proportion of Variance Explained")
Performing 3-Means Clustering resulted in producing 3 groups with the following characteristics. Interestingly, we can see that the 3 clusters seem to be most differentiated by the AQI score of \(O_3\).
aqi_train_dat <- na.omit(aqi_train) %>% #omitting all NA's
select(stationname, contains("aqi"), final_cat, - max_aqi, longitude, latitude) %>%
as.data.frame()
set.seed(3)
aqi_kproto <- kproto(aqi_train_dat[,2:8], k = 3, nstart = 25)
par(mfrow=c(1,4))
clprofiles(aqi_kproto, aqi_train_dat)
aqi_train_dat <- aqi_train_dat %>%
mutate(cluster = factor(aqi_kproto$cluster)) %>%
as_tibble()
#Create a table showing a sample of colleges from each cluster
set.seed(3)
tibble(Cluster_1 = aqi_train_dat %>%
filter(cluster == 1) %>%
sample_n(15) %>%
pull(stationname),
Cluster_2 = aqi_train_dat %>%
filter(cluster == 2) %>%
sample_n(15) %>%
pull(stationname),
Cluster_3 = aqi_train_dat %>%
filter(cluster == 3) %>%
sample_n(15) %>%
pull(stationname)) %>%
kable("html") %>%
kable_styling(full_width = FALSE, position = "left") %>%
add_header_above(c("Sample of areas from each cluster" = 3))
| Cluster_1 | Cluster_2 | Cluster_3 |
|---|---|---|
| Weihai | Gaoping District Monitoring Station, Nanchong | Qujiang Cultural Industry Group, Xian |
| Yangjiang | No.1 middle school, Jincheng | Shunyi New Town |
| Helanshan Maliankou, Yinchuan | Cixi Experimently Primary School | Baoji |
| Yuanyang Lake, Yangjiang | Fengcheng Subdistrict, Qingyuan | Chaoyang Agricultural Exhibition Hall |
| Cultural Square, Mudanjiang | City Monitoring Station, Jinan | Dongshan mills, Hegang |
| Pingfang Light Factory, Haerbin | Quancheng Plaza, Jinan | Qinghe Primary School, Jiaxing |
| City Station, Wenzhou | City Stadium, Xian | Shinan district No. 1, Qingdao |
| Dongcheng Shijing, Dongguan | Sanming | Mangdangshan, Nanping |
| Shenfu new town, Fushun | Laiwu | Yanliang District, Xian |
| Hunan University of Traditional Chinese Medicine, Changsha | Yutian County , Tangshan | Zhangjiakou |
| Huairou town | Customs, Liaocheng | Woodworking Factory, Weihai |
| Veterinary Station, Jingdezhen | Guyuan County welfare , Zhangjiakou | No. 1 hospital, Mudanjiang |
| Shangrao | Tanggu Yingkou street, Tianjin | Konggang, Chongqing |
| Taoyuan, Taiyuan | Qingyuan City Hall, Baoding | Huaxi District, Guiyang |
| Quzhou University, Quzhou | Dunkou Xinou | Henan University No.1 Hospital, Kaifeng |
Map Visualization by cluster (in different colors)
I was originally hoping to see geographical separation among the three clusters. However, as you can see from the map visualization below, they do not seem to be separated by location.
china +
geom_jitter(aes(x = longitude, y = latitude, col = cluster),
data = aqi_train_dat) +
scale_color_manual(values=c("red", "blue", "black")) +
theme(legend.position = "top")
SVM are intended for the binary classification setting in which there are two classes. Therefore I decided simplify the 6 categories into 2 categories. The categories “very bad”, “dangerous”, “toxic” forms one category and the rest forms another category. However, as you can see below from the map, the majority of the points overlap one another, so I knew that support vector classifier (linear kernel) will be almost useless in this dataset. Also, the polynomial kernel proved to be not helpful and computationally too expensive (code is contained inside the rmd file; it is just not evaluated). Only the radial kernel have shown to be somewhat useful.
#divide into 2 groups for final_cat
#for train
aqi_train_svm <- aqi_train %>%
na.omit() %>% #NA's are omitted
mutate(two_cat = factor( final_cat %in% c("very bad", "dangerous", "toxic")))%>%
select(two_cat, latitude, longitude)
#for test
aqi_test_svm <- aqi_test %>%
na.omit() %>% #NA's are omitted
mutate(two_cat = factor( final_cat %in% c("very bad", "dangerous", "toxic")))%>%
select(two_cat, latitude, longitude)
china +
geom_jitter(aes(x = longitude, y = latitude, col = two_cat),
data = aqi_train_svm)
Support Vector Classifier
10-fold cross validation was used for determining the optimal cost. As you can see, linear kernel is not helpful in retrieving any kind of useful information.
set.seed(3)
tune_lin <- tune(svm, two_cat ~ .,
data = aqi_train_svm, kernel = "linear",
ranges = list(cost = c(0.1, 0.5, 1, 5, 10, 50, 100)),
scale = TRUE)
# tune_lin %>% summary()
lin_best_mod <- tune_lin$best.model
par(mar=c(1,1,1,1))
plot(lin_best_mod, aqi_train_svm, latitude ~ longitude)
Support vector Machines
set.seed(3)
tune_rad <- tune(svm, two_cat ~ .,
data = aqi_train_svm,
kernel = "radial",
ranges = list(cost = c(0.1, 1, 10, 100, 1000), #giving several values for cost
gamma = c(0.5, 1, 2, 3, 4)),
scale = TRUE)
#tune_rad %>% summary()
rad_best_mod <- tune_rad$best.model
plot(rad_best_mod, aqi_train_svm, latitude ~ longitude,
color.palette = terrain.colors) +
lines(y = china1$y, x=china1$x*0.8+15, type = "l")
## integer(0)
pred_svm <- predict(rad_best_mod, na.omit(aqi_test_svm))
Confusion Matrix with the validation set
table(pred_svm, na.omit(aqi_test_svm)$two_cat) %>%
table_nice()
| FALSE | TRUE | |
|---|---|---|
| FALSE | 3 | 6 |
| TRUE | 8 | 101 |
Correct rate
#Correct rate
mean(pred_svm == na.omit(aqi_test_svm)$two_cat) %>% table_nice #0.8813559
| x |
|---|
| 0.8813559 |
Since we have 7 classes, this problem is an instance of “multi-class classification”, and since each observation should be classified into only one category, the problem is an instance of single-label, multi-class classification.
#set.seed(3) # for keras set.seed(3) is NOT enough for reproducibility
use_session_with_seed(3) # for reproducibility
#Implement one hot coding
ohe_rules <- aqi %>% na.omit() %>% #NA's are discarded
select(aqi_pm10, aqi_pm25, aqi_no2, aqi_so2, aqi_o3, aqi_co) %>% #only the pollutants' AQI scores
onehot(., max_levels = 30)
#using predict() to create one hot encoded data sets
x_train <- aqi_train %>% na.omit() %>%
predict(ohe_rules, data = .)
x_test <- aqi_test %>% na.omit() %>%
predict(ohe_rules, data = .)
#Train and Test targets
one_hot_train_labels <- aqi_train %>% na.omit() %>% #NA's are discarded
pull(final_cat) %>% to_categorical() %>% .[,-1]
one_hot_test_labels <- aqi_test %>% na.omit() %>% ##NA's are discarded
pull(final_cat) %>% to_categorical() %>% .[,-1]
#Standardize - using the statistics of the training set
means_train_dat <- apply(x_train, 2, mean)
std_train_dat <- apply(x_train, 2, sd)
train_data <- scale(x_train,
center = means_train_dat,
scale = std_train_dat)
test_data <- scale(x_test,
center = means_train_dat,
scale = std_train_dat)
#Building our network
model <- keras_model_sequential() %>%
layer_dense(units = 128, activation = "relu",
input_shape = dim(train_data)[[2]]) %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 7, activation = "softmax")
model %>%
compile(
optimizer = "rmsprop",
loss = "categorical_crossentropy",
metrics = c("accuracy")
)
#Validating our approach
val_indices <- sample(dim(train_data)[1], 323) # using 50% of the train_data for validation set
x_val <- x_train[val_indices,]
partial_x_train <- x_train[-val_indices,]
y_val <- one_hot_train_labels[val_indices,]
partial_y_train <- one_hot_train_labels[-val_indices,]
history <- model %>% fit(
partial_x_train,
partial_y_train,
epochs = 80,
batch_size = 68, # smaller the batch the less accurate estimate of the gradient bur requires less memory and Typically networks trains faster with mini-batches. That's because we update weights after each propagation
validation_data = list(x_val, y_val)
)
plot(history)
#predictions for all of the test data
predictions <- model %>% predict(x_test)
#The coefficients in this vector sum to 1:
sum(predictions[4,])
sum(predictions[118,])